How complete is the CDC's COVID-19 case surveillance data for race/ethnicity at the state and county levels?

Status: Draft

Feb 1, 2021

In [ ]:
#@title
import pandas as pd
import altair as alt
from vega_datasets import data

from google.colab import auth
auth.authenticate_user()

# Turn off the three-dot menu for Altair/Vega charts.
alt.renderers.set_embed_options(actions=False)
#%load_ext google.colab.data_table
In [225]:
#@title
def FieldAnalysis(project_id, table, field_list):
  dict = {}
  for field in field_list:
      dict[field] = [0.0, 0.0, 0.0, 0.0]
  unknowns = pd.DataFrame(dict, index=['Unknown', 'Missing', 'NA', 'Known'])
  field_series = []
  value_series = []
  percent_series = []

  for field in field_list:
    field_unknowns_query = ('''
    SELECT
      %s,
      count(*) as cases
    FROM
      %s
    GROUP BY
      %s
    ''')
    query = field_unknowns_query % (field, table, field)
    field_unknowns_df = pd.io.gbq.read_gbq(query, project_id=project_id)
    field_unknowns_df.set_index(field, inplace=True)
    field_unknowns_df.index = field_unknowns_df.index.fillna('Null')

    field_display_name = {
        'cdc_case_earliest_dt': 'CDC earliest case date',
        'current_status': 'Case status',
        'res_state': 'State',
        'res_county': 'County',
        'sex': 'Sex',
        'age_group': 'Age',
        'race_ethnicity_combined': 'Race/Ethnicity'}

    missing_count = 0
    if 'Missing' in field_unknowns_df.index:
      missing_count += field_unknowns_df.loc['Missing'].cases
    if 'Null' in field_unknowns_df.index:
      missing_count += field_unknowns_df.loc['Null'].cases
    #if field_unknowns_df.index.isnull().any():
    #  missing_count += field_unknowns_df.loc[field_unknowns_df.index.isnull()].cases
    unknowns.loc['Missing', field] = missing_count / field_unknowns_df.cases.sum()

    if 'Unknown' in field_unknowns_df.index:
      unknowns.loc['Unknown', field] = field_unknowns_df.loc['Unknown'].cases / field_unknowns_df.cases.sum()
    if 'NA' in field_unknowns_df.index:
      unknowns.loc['NA', field] = field_unknowns_df.loc['NA'].cases / field_unknowns_df.cases.sum()
    unknowns.loc['Known', field] = 1 - (unknowns.loc['Missing', field] +
                                        unknowns.loc['Unknown', field] +
                                        unknowns.loc['NA', field])
    field_series.extend([field_display_name.get(field, field)] * 4)
    value_series.extend(['Known', 'Supressed', 'Unknown', 'Missing'])
    percent_series.extend([unknowns.loc['Known', field],
                           unknowns.loc['NA', field],
                           unknowns.loc['Unknown', field],
                           unknowns.loc['Missing', field]])
  test = pd.DataFrame.from_dict({'field': field_series,
                               'value': value_series,
                               'percent': percent_series})
  return alt.Chart(test).mark_bar().encode(
      x=alt.X('percent', axis=alt.Axis(format='%'), title=''),
      y=alt.Y('field', sort='x', title='Field'),
      color=alt.Color('value', scale=alt.Scale(scheme='category20'), title='Value'),
      order=alt.Order('field:N'),
      tooltip=[
                  alt.Tooltip('field:N', title='Field'),
                  alt.Tooltip('value:N', title='Value'),
                  alt.Tooltip('percent:Q', format=',.0%', title='Percent'),
      ]
  )

Background

The racial and ethnic disparities in the COVID-19 pandemic have exposed longstanding health inequities in the U.S., which have been described in multiple analyses of COVID-19 data by the Covid Tracking Project, New York Times, American Public Media Research Lab, Kaiser Family Foundation, and National Public Radio among many others.

Unfortunately, we still don't have a full understanding of the degree of these disparities because of the fragmented landscape of race/ethnicity data. On Jan 29, the Covid Tracking Project wrote "the continued lack of either complete federal demographic data or federal guidelines for what states should publish make it impossible to fully understand who is being infected with and dying of COVID-19" and that we are still missing race information for about a third of COVID-19 cases. The most reliable and up-to-date data is scattered across individual state public health websites that use different standards and categories for reporting race/ethnicity. Collecting this data has largely been left to non-governmental organizations like the Covid Tracking Project, which announced that they will stop collecting data on March 7, 2021, a full year after they started. Even the federal government has looked to the Covid Tracking Project for reliable data on race/ethnicity for COVID-19 data. The office of the Assistant Secretary for Planning and Evaluation, an agency within Health and Human Services, wrote in Oct 2020 that "The volunteer-based COVID tracking project has created the most comprehensive centralized resource for race and ethnicity data at the state level."

For case data in particular, more so than deaths data, the CDC has only published public race/ethnicity data at the U.S. level, not the state or county levels. In July 2020, the New York Times published The Fullest Look Yet at the Racial Inequity of Coronavirus, a one-time analysis of CDC data obtained via FOIA and legal action that contained county-level case data with race/ethnicity up to May 28, 2020. Several non-governmental organizations have taken it upon themselves to gather data for total case counts at the county level (New York Times, Johns Hopkins University, USAFacts), but none of them have collected race/ethnicity data; this would be a huge undertaking due to the non-standard reporting of race/ethnicity across state and local public health websites.

In Nov 2020, the CDC made some of the case data that the New York Times obtained public: county-level totals in a dashboard and public data about race/ethnicity with additional dashboards, but without state and county details. They also released restricted access data with race/ethnicity, state, and county available upon request. The CDC's initial restricted access data agreement did not allow for county-level analyses to be made public, but an updated data agreement from Dec 14, 2020 allowed such public analyses. In Jan 2021, the Morehouse School of Medicine's Satcher Health Leadership Institute (MSM/SHLI) in collaboration with Google.org applied for and got access to this data.

The CDC Restricted Access data enabled us to complete the first analysis of race/ethnicity disparities across the U.S. at the county level since the New York Times analysis in July. However, the data has significant completeness issues; e.g., only 55% of cases have known race/ethnicity, which translates to us not knowing the race/ethnicity of 6M people who got COVID-19 out of 13.4M total up to Dec 16, 2020.

For the 7.4M cases where we do know race/ethnicity, we can see the following disparities across race/ethnicity groups:

In [226]:
#@title

test = pd.DataFrame.from_dict({'group': [
                                         'Black', 'Hispanic/Latino', 'White', 'Asian/NHPI', 'AIAN', '-Total-',
                                         ],
                               'percent': [.0234, .0254, .0211, .0129, .0376, .0409]})
alt.Chart(test).mark_bar().encode(
      x=alt.X('percent', axis=alt.Axis(format='.1%'), title=''),
      y=alt.Y('group', sort='-x', title=''),
      color=alt.Color('group', 
                      scale=alt.Scale(scheme='category20'),
                      title=''),
      order=alt.Order('percent:N'),
      tooltip=[
                  alt.Tooltip('group:N', title='Race/Ethnicity Group'),
                  alt.Tooltip('percent:Q', format='.2%', title='Cases in race/ethnicity group'),
      ]
).properties(
    title='Percent of Race/Ethnicity Group in the U.S. who had COVID-19 as of Dec 16'
).display()

Note: AIAN: American Indian or Alaska Native; NHPI: Native Hawaiian or Pacific Islander.

But the chart above is based on incomplete data. If all 6M cases with missing race/ethnicity were Hispanic/Latino people, the percent of Hispanic/Latino people in the U.S. who had COVID-19 would go from 2.5% to 12.5%. If all 6M cases with missing race/ethnicity were Black people, the percent of Black people in the U.S. who had COVID-19 would go from 2.3% to 17.1%. While both of these scenarios are highly unlikely, they show why missing race/ethnicity data is keeping us from fully understanding and addressing the disparities in this pandemic.

Overview

The goal of this analysis is to assess the completeness of the CDC's Restricted Access data and its feasibility in examining disparities in race/ethnicity for COVID-19 cases at the state and county levels. We'll first give an overview of the tradeoffs between aggregate and case-level data. We will next compare the total case counts in the restricted access data to two reliable public aggregate datasets at the state and county levels. We will then compare cases with race/ethnicity at the state level to the Covid Tracking Project's Covid Racial Data Tracker data.

The top-level data completeness findings are:

  1. Data Overview: Most fields in the CDC's restricted access data are missing too many values to be useful. The only fields that we used were state, county, age, sex, and race/ethnicity. All other fields, including whether the person died or was hospitalized, are known for fewer than 50% of the cases. Race/ethnicity was only known for 55% of cases, as opposed to 97%-100% for all the other fields below.
In [227]:
#@title
field_list = ['cdc_case_earliest_dt', 'current_status', 'res_state', 'res_county', 'sex', 'age_group', 'race_ethnicity_combined']
project_id = 'msm-secure-data-1b'
table = '`msm-secure-data-1b.ndunlap_secure.cdc_restricted_access_20201231`'
FieldAnalysis(project_id, table, field_list).display()
  1. Total Case Counts: The CDC data contains 80% of the total cases reported in more reliable data sources on the same data. There's high variability at the state level, where the state with the biggest discrepancy has only 3% of the total cases in more reliable data sources. While it is expected that the CDC will lag in some cases, the time lag alone can't explain the discrepancies in some states and counties.
  2. Cases with Race/Ethnicity: Race/ethnicity data is available for 55% of cases in the CDC data compared to 66% in the Covid Racial Data Tracker. Race/ethnicity data availability is highly variable across different states, which is common to both the CDC and Covid Racial Data Tracker data, but the Covid Racial Data Tracker has more cases with race/ethnicity data than the CDC in all but four states.

Recommendations on how to improve state and county data:

  • Most states and counties need to report more information for race and ethnicity with something other than Unknown and make sure that it gets to the CDC. However, a few states are missing entire swaths of cases that point to larger issues.

What we didn't include in this report:

  • An analysis of data about deaths, which you can find in a separate report. Note, however, that fewer than 50% of cases have the field filled in that says whether the person died (see Appendix). Deaths data can also come from death certificates, reported in the the CDC Provisional Deaths data, and may be more complete than the case data we are looking at here (more details at the CDC site).
  • A disparities analysis; we did this in a separate disparities report that requires many caveats due to the incompleteness of the data.

Completeness Analysis

Data Overview

The case data comes from a case report form that is a dense, two-page form to get information about each lab-confirmed or probable COVID-19 case. The restricted access data contains 32 fields, which are described on the CDC website. The public version of the restricted access data contains 12 of those fields. The CDC has extensive FAQs about this surveillance data, one of which is about completeness:

How complete are the data that the CDC receives about COVID-19 cases?

The COVID-19 pandemic has put unprecedented demands on the public health data supply chain. In many states, the large number of COVID-19 cases has severely strained the ability of hospitals, healthcare providers, and laboratories to report cases with complete demographic information, such as race and ethnicity. The unprecedented volume of cases has also limited the ability of state and local health departments to conduct thorough case investigations and collect all requested case data.

As a result, many COVID-19 case notifications submitted to CDC do not have complete information on patient demographics [...] Because it can be time-consuming for jurisdictions to collect the additional information, these data can lag behind the aggregate counts. Because of missing data, analyses of these data elements are likely an underestimate of the true occurrence.

The CDC distinguishes between aggregate data that comes from state and local public health websites vs. line- or case-level data that comes to the CDC from public health departments. The CDC FAQs say that aggregate data is more accurate than case data:

Aggregate counts provide the most up-to-date validated numbers on cases and deaths.

Because it can be time-consuming for jurisdictions to collect the additional information, these data can lag behind the aggregate counts. Although CDC receives this information for most cases, it does not receive it for all cases.

Aggregate data from public health websites often does contain race/ethnicity details, but all state websites do not all use the same standard race/ethnicity categories. The Covid Racial Data Tracker captures the many non-standard ways in which different states report on race/ethnicity, where ethnicity is whether a person is Hispanic/Latino. Some states report race/ethnicity as a combined field where each race/ethnicity group is mutually exclusive, which is how the CDC case data reports this field. Other states report race/ethnicity as separate fields where Hispanic/Latino people are counted within different race groups as well as in a separate field for ethnicity. States can also differ in terms of which race categories they use, how they define them, whether multiracial people are counted multiple times in different categories, and what's included in the "Other" race category. For more details, see this Covid Racial Data Tracker analysis.

So, we will need to sacrifice the accuracy and timeliness of aggregate data to get standardized race/ethnicity reporting on cases across all states and counties. However, standardized reporting on race/ethnicity is only useful if we have that data in enough states and counties.

In the CDC case data, race/ethnicity is known for only 55% of cases. The cases without known race/ethnicity fall into the following categories:

  1. Marked as "Unknown" on the case report form (35%)
  2. Missing due to being left blank on the form (4%)
  3. Suppressed for privacy reasons for small geographic and/or demographic population groups (2%)

The CDC discussed the incompleteness of race/ethnicity data in their case data FAQs:

[...] in many states, the large number of COVID-19 cases has severely strained the ability to report cases with complete demographic information for race and ethnicity. With thousands of cases being reported, completeness of these elements is unlikely to improve in the immediate future for some jurisdictions.

Has this data gotten more complete since the New York Times obtained a copy of the case surveillance data in May, 2020? Based on the comparison table below, the data has improved in terms of more counties included and a higher percent of cases with race/ethnicity and county; however, some of those differences may be due to the fact that there are simply more counties with COVID-19 cases in the more recent data. The percent of cases included in the data has also improved as compared to totals from the Covid Tracking Project (CTP).

In [287]:
#@title
# Manually update these fields based on the latest CDC data.
row_names = [
    'Update frequency',
    'Data as of date',
    'Cases in CTP as of date',
    'Cases in data as of date',
    '(as a % of CTP)',
    'Number of counties',
    '(as a % of all counties)',
    'U.S. population in those counties',
    '(as a % of total U.S population – States + D.C.)',
    'Cases with known race/ethnicity and county',
    '(as a % of cases in data)',
    'Access'
]
nyt_cdc_metadata = [
    'Once',
    'May 28, 2020',
    '1.7M',
    '1.4M',
    '(88%)',
    '974',
    '(31%)',
    '178M',
    '(~55%)',
    '~0.6M',
    '(44%)',
    'Not public'
]
cdc_metadata = [
    'Monthly', 
    'Dec 16, 2020',
    '16.8M',
    '13.4M',
    '(80%)',
    '3,046',
    '(97%)',
    '324M',
    '(99.8%)',
    '7.2M',
    '(54%)',
    'Restricted'
]
table_data = {'NYT/CDC': nyt_cdc_metadata, 'CDC': cdc_metadata}
metadata_df = pd.DataFrame(table_data, index=row_names)
metadata_df.head(15)
Out[287]:
NYT/CDC CDC
Update frequency Once Monthly
Data as of date May 28, 2020 Dec 16, 2020
Cases in CTP as of date 1.7M 16.8M
Cases in data as of date 1.4M 13.4M
(as a % of CTP) (88%) (80%)
Number of counties 974 3,046
(as a % of all counties) (31%) (97%)
U.S. population in those counties 178M 324M
(as a % of total U.S population – States + D.C.) (~55%) (99.8%)
Cases with known race/ethnicity and county ~0.6M 7.2M
(as a % of cases in data) (44%) (54%)
Access Not public Restricted

Sources: NYT article and The Daily podcast episode about the article, CTP total case counts for the U.S. by date.

Total Case Counts

We will compare the CDC data against two sources of aggregate data: The Covid Racial Data Tracker (CRDT) and the NYT's public data, which are both updated on a regular basis (CRDT twice a week, NYT daily) and come from state and local public health websites. CRDT is the only source for case data with race/ethnicity breakdowns, but there are several sources for county-level aggregate case data in addition to the NYT, such as JHU and USAFacts (this paper analyzes the differences between those sources at the state level up to July for cases and deaths).

The table below compares geographic vs. race/ethnicity availability for these three different data sources:

  • NYT: New York Times COVID-19 Public Data
  • CRDT: Covid Racial Data Tracker Public Data
  • CDC: CDC Case Surveillance Restricted Access Data
In [229]:
#@title
row_names = [
    'Total Cases — States',
    'Total Cases — Counties',
    'Cases by Race/Ethnicity — States',
    'Cases by Race/Ethnicity — Counties'
]
nyt_yn = [
    '✅',
    '✅',
    '❌',
    '❌',
]
crdt_yn = [
    '✅',
    '❌',
    '✅',
    '❌',
]
cdc_yn = [
    '✅',
    '✅',
    '✅',
    '✅',
]
table_data = {'NYT': nyt_yn, 'CRDT': crdt_yn, 'CDC': cdc_yn}
availability_df = pd.DataFrame(table_data, index=row_names)
availability_df.head()
Out[229]:
NYT CRDT CDC
Total Cases — States
Total Cases — Counties
Cases by Race/Ethnicity — States
Cases by Race/Ethnicity — Counties

Because the CDC is the only data source that has race/ethnicity at the county level, the most similar data for purposes of comparison are (1) CRDT data at the state level with race/ethnicity, and (2) NYT data at the county level with no race/ethnicity.

We will compare across these data sources up to Dec 16, 2020, which is the latest reporting date in the CDC data. We expect to see differences in the case counts due to lags in reporting the data, but we don't expect that time lags can explain large percentages of missing cases.

Baseline: NYT vs. CRDT

To get a baseline of how much we could expect the CDC case counts to match the CRDT or NYT, we can see how closely the CRDT and NYT match each other. Each dot below is a state (hover to see details), and the black line shows where the NYT and CRDT case counts are equal.

In [230]:
#@title
CASES = 'Cases'
DATASET = 'cdc'
metric = CASES

project_id = 'msm-secure-data-1b'
cdc_table = '`%s.ndunlap_secure.cdc_restricted_access_20201231`' % project_id
date = 'DATE(2020, 12, 16)'
date_int = '20201216'
date_display_name = 'Dec 16'

# Chart settings.
total_cases_scale_max = 2000000
scatter_height = 300
scatter_width = 300
map_height = 300
map_width = 450
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

territories = ('PR', 'GU', 'VI', 'MP', 'AS')
In [231]:
#@title
crdt_query = ('''
SELECT
  State as state,
  Cases_Total as crdt_cases,
  Cases_Total - Cases_Unknown as crdt_known_race_cases,
  ROUND(1 - Cases_Unknown / Cases_Total, 4) as crdt_known_race_cases_percent,
FROM `msm-secure-data-1b.ndunlap_secure.crdt`
WHERE
  date = %s
''' % date_int)

nyt_states_query = ('''
SELECT
  state_name,
  state_fips_code,
  confirmed_cases as nyt_cases,
  deaths as nyt_deaths
FROM `bigquery-public-data.covid19_nyt.us_states`
WHERE
  date = %s AND
  state_fips_code IS NOT NULL
''' % date)

nyt_counties_query = ('''
SELECT
  county_fips_code,
  confirmed_cases as nyt_cases,
FROM `bigquery-public-data.covid19_nyt.us_counties`
WHERE
  date = %s AND
  county_fips_code IS NOT NULL
''' % date)

cdc_states_query = ('''
SELECT
  res_state,
  COUNT(*) as cdc_cases
FROM
  %s
GROUP BY
   res_state
''' % cdc_table)

cdc_counties_query = ('''
SELECT
  res_state,
  res_county,
  race_ethnicity_combined,
  COUNT(*) as cases
FROM
  %s
GROUP BY
   res_county,
   res_state,
   race_ethnicity_combined
''' % cdc_table)

compare_cases_unknowns_query = ('''
SELECT
  res_state,
  race_ethnicity_combined,
  COUNT(*) as cdc_cases
FROM
  %s
GROUP BY
   res_state,
   race_ethnicity_combined
''' % cdc_table)
In [289]:
#@title
def CreateScatterPlot(
    chart_df, fields_dict, title, scale_max, height, width, geo, metric_type):
  
  geo_field = 'state'
  geo_field_display_name = 'State'
  if geo == 'county':
    geo_field = 'state_county'
    geo_field_display_name = 'County'

  if metric_type == 'ratio':
    scale_scheme = 'blueorange'
    scale_reverse = True
    scale_domain = [0, 2]
    legend_format = '.1f'
    axis_format = ',.0f'
  elif metric_type == 'percent':
    scale_scheme = 'redyellowblue'
    scale_reverse = False
    scale_domain = [0, 1]
    legend_format = '.0%'
    axis_format = '.0%'

  tooltips = [alt.Tooltip(geo_field + ':N', title=geo_field_display_name)]
  for field in ('y', 'x', 'percent'):
    tooltips.append(alt.Tooltip(
        fields_dict[field]['name'] + ':Q',
        format=fields_dict[field]['format'],
        title=fields_dict[field]['title'],
    ))
  plot = alt.Chart(chart_df).mark_circle(size=60).encode(
      alt.X(fields_dict['x']['name'] + ':Q', axis=alt.Axis(title=fields_dict['x']['title'], format=axis_format),
          scale=alt.Scale(domain=(0, scale_max))
      ),
      alt.Y(fields_dict['y']['name'] + ':Q', axis=alt.Axis(title=fields_dict['y']['title'], format=axis_format),
          scale=alt.Scale(domain=(0, scale_max))
      ),
      color=alt.Color(fields_dict['percent']['name'],
                      type='quantitative',
                      scale=alt.Scale(scheme=scale_scheme,
                                      reverse=scale_reverse,
                                      domain=scale_domain,
                                      clamp=True),
                      legend=alt.Legend(format=legend_format),
                      title=metric_type.capitalize()),
      tooltip=tooltips,
  ).properties(
      height=height,
      width=width,
  )
  if metric_type == 'ratio':
    plot.interactive()

  line = pd.DataFrame({
      'x': [0, scale_max],
      'y': [0, scale_max],
  })

  if metric_type == 'ratio':
    line_plot = alt.Chart(line).mark_line(color='black').encode(
        x='x',
        y='y',
    )
  elif metric_type == 'percent':
    line_plot = (
        alt.Chart(pd.DataFrame({'x': [.5]})).mark_rule().encode(y='x') +
        alt.Chart(pd.DataFrame({'y': [.5]})).mark_rule().encode(x='y')
    )
  # Add interative for concatenating due to https://github.com/altair-viz/altair/issues/2010.
  scatter = (plot + line_plot).properties(
      title=title,
      height=height,
      width=width,
  ).interactive()
  return scatter

def CreateMap(
    chart_df, fields_dict, title, scale_max, height, width, geo, metric_type):
  
  geo_field = 'state'
  geo_field_display_name = 'State'
  fips_code = 'state_fips_code'
  topo_feature = us_states
  if geo == 'county':
    geo_field = 'state_county'
    geo_field_display_name = 'County'
    fips_code = 'county_fips'
    topo_feature = us_counties

  if metric_type == 'ratio':
    scale_scheme = 'blueorange'
    scale_reverse = True
    scale_domain = [0, 2]
    legend_format = '.1f'
  elif metric_type == 'percent':
    scale_scheme = 'redyellowblue'
    scale_reverse = False
    scale_domain = [0, 1]
    legend_format = '.0%'

  highlight = alt.selection_single(on='mouseover', fields=['id', fips_code], empty='none')
  tooltips = [alt.Tooltip(geo_field + ':N', title=geo_field_display_name)]
  for field in ('y', 'x', 'percent'):
    tooltips.append(alt.Tooltip(
        fields_dict[field]['name'] + ':Q',
        format=fields_dict[field]['format'],
        title=fields_dict[field]['title'],
    ))

  field_names = [geo_field]
  field_names.extend([fields_dict[field]['name'] for field in fields_dict])
  plot = alt.Chart(topo_feature).mark_geoshape(
        stroke='white',
        strokeOpacity=.2,
        strokeWidth=1
    ).project(
      type='albersUsa'
    ).transform_lookup(
        lookup='id',
        from_=alt.LookupData(chart_df, fips_code, field_names)
    ).encode(
        alt.Color(fields_dict['percent']['name'],
                  type='quantitative',  
                  legend=alt.Legend(format=legend_format),
                  scale=alt.Scale(scheme=scale_scheme,
                                  reverse=scale_reverse,
                                  domain=scale_domain,
                                  clamp=True,
                                  ),
                  title=metric_type.capitalize()),
         tooltip=tooltips
    ).add_selection(
        highlight,
    )

  states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
        type='albersUsa'
  )

  states_fill = alt.Chart(us_states).mark_geoshape(
        fill='silver',
        stroke='white'
  ).project('albersUsa')

  layered_map = alt.layer(states_fill, plot, states_outline).properties(
        height=height,
        width=width,
        title=title,
  )
  return layered_map

def CreateScatterPlotAndMap(
    chart_df, fields_dict, title, total_cases_scale_max, scatter_height, scatter_width, map_width, geo, metric_type):
  scatter = CreateScatterPlot(
    chart_df, fields_dict, title, total_cases_scale_max, scatter_height, scatter_width, geo, metric_type)
  map = CreateMap(
    chart_df, fields_dict, title, total_cases_scale_max, scatter_height, map_width, geo, metric_type)
  return (scatter | map).configure_view(
       strokeWidth=0,
   ).configure_mark(
       stroke='grey'
   ).configure_legend(
       gradientLength=scatter_height - 50
   )

def PrintSummaryStats(chart_df, field='percent'):
  below_15 = len(chart_df[chart_df[field] < .85]) / len(chart_df)
  above_15 = len(chart_df[chart_df[field] > 1.15]) / len(chart_df)
  print('between +/-15%: ', round(1 - below_15 - above_15, 2))
  below_50 = len(chart_df[chart_df[field] < .5]) / len(chart_df)
  above_50 = len(chart_df[chart_df[field] > 1.5]) / len(chart_df)
  print('between +/-50%: ', round(1 - below_50 - above_50, 2))
  print('< than .50: ', len(chart_df[chart_df[field] < .5]))
  print('> than 1.50: ', len(chart_df[chart_df[field] > 1.5]))
  print(chart_df[field].describe())
In [233]:
#@title
states_to_fips = {'AL': 1, 'AK': 2, 'AZ': 4, 'AR': 5, 'AS': 3, 'CA': 6, 'CO': 8, 'CT': 9, 'DC': 11, 'DE': 10, 'FL': 12, 'GA': 13, 'GU': 14, 'HI': 15, 'ID': 16, 'IL': 17, 'IN': 18, 'IA': 19, 'KS': 20, 'KY': 21, 'LA': 22, 'ME': 23, 'MD': 24, 'MA': 25, 'MI': 26, 'MN': 27, 'MS': 28, 'MO': 29, 'MT': 30, 'NE': 31, 'NV': 32, 'NH': 33, 'NJ': 34, 'NM': 35, 'NY': 36, 'NYC': 36, 'NC': 37, 'ND': 38, 'OH': 39, 'OK': 40, 'OR': 41, 'PA': 42, 'PR': 43, 'RI': 44, 'SC': 45, 'SD': 46, 'TN': 47, 'TX': 48, 'UT': 49, 'VT': 50, 'VA': 51, 'VI': 52, 'WA': 53, 'WV': 54, 'WI': 55, 'WY': 56, 'AS': 60, 'GU': 66, 'MP': 69, 'PR': 72, 'VI': 78}

crdt_df = pd.io.gbq.read_gbq(crdt_query, project_id=project_id)
crdt_df.set_index('state', inplace=True)

nyt_states_df = pd.io.gbq.read_gbq(nyt_states_query, project_id=project_id)
nyt_states_df.state_fips_code.unique()
nyt_territories = ('Puerto Rico', 'Guam', 'Virgin Islands', 'Northern Mariana Islands', 'American Samoa')
for territory in nyt_territories:
  nyt_states_df = nyt_states_df[nyt_states_df.state_name != territory]
nyt_states_df['state_fips_code'] = nyt_states_df.state_fips_code.astype(int)
nyt_states_df.set_index('state_fips_code', inplace=True)

crdt_df.reset_index(inplace=True)
crdt_df['state_fips_code'] = crdt_df.state
crdt_df = crdt_df.replace(to_replace={'state_fips_code': states_to_fips})
crdt_df.set_index('state_fips_code', inplace=True)
nyt_crdt_merged_df = nyt_states_df.join(crdt_df, on="state_fips_code", how='inner', lsuffix='_left', rsuffix='_right')

nyt_crdt_merged_df['percent'] = round(nyt_crdt_merged_df.nyt_cases / nyt_crdt_merged_df.crdt_cases, 2)
nyt_crdt_merged_df
nyt_crdt_merged_df.reset_index(inplace=True)


below_15 = len(nyt_crdt_merged_df[nyt_crdt_merged_df.percent < .85]) / len(nyt_crdt_merged_df)
above_15 = len(nyt_crdt_merged_df[nyt_crdt_merged_df.percent > 1.15]) / len(nyt_crdt_merged_df)
#print('between +/-15%: ', round(1 - below_15 - above_15, 2))
#nyt_crdt_merged_df.percent.describe()
In [234]:
#@title
nyt_crdt_fields_dict = {
    'x': {'name': 'crdt_cases', 'format': ',', 'title': 'CRDT cases'},
    'y': {'name': 'nyt_cases', 'format': ',', 'title': 'NYT cases'},
    'percent': {'name': 'percent', 'format': '.2f', 'title': 'Ratio of NYT to CRDT'},
}
nyt_crdt_title = 'Ratio of NYT to CRDT Cases by State as of %s' % date_display_name

CreateScatterPlotAndMap(
    nyt_crdt_merged_df, nyt_crdt_fields_dict, nyt_crdt_title, total_cases_scale_max, scatter_height, scatter_width, map_width, 'state', 'ratio'
).display()

The ratio of NYT to CRDT cases is between 0.97 and 1.11 for all states:

  • Average = 1.01
  • Median = 1.00
  • Min = 0.97 (Tennessee)
  • Max = 1.11 (Georgia)
  • Percent between 0.85 and 1.15 = 100% (51 states + D.C. within +/- 0.15)

States: CDC vs. CRDT

We can see below that the CDC case counts differ from the CRDT case counts much more drastically than the NYT did.

In [235]:
#@title
cdc_states_df = pd.io.gbq.read_gbq(cdc_states_query, project_id=project_id)
cdc_states_df.rename(columns={'res_state': 'state'}, inplace=True)
cdc_states_df.set_index('state', inplace=True)

crdt_df = pd.io.gbq.read_gbq(crdt_query, project_id=project_id)

for territory in territories:
  crdt_df = crdt_df[crdt_df.state != territory]

crdt_df.set_index('state', inplace=True)
cdc_crdt_merged_df = cdc_states_df.join(crdt_df, on="state", how='inner', lsuffix='_left', rsuffix='_right')
cdc_crdt_merged_df.reset_index(inplace=True)
cdc_crdt_merged_df['state_fips_code'] = cdc_crdt_merged_df.state
cdc_crdt_merged_df = cdc_crdt_merged_df.replace(to_replace={'state_fips_code': states_to_fips})
cdc_crdt_merged_df['percent'] = round(cdc_crdt_merged_df.cdc_cases / cdc_crdt_merged_df.crdt_cases, 4)

# PrintSummaryStats(cdc_crdt_merged_df)
In [236]:
#@title
cdc_crdt_fields_dict = {
    'x': {'name': 'crdt_cases', 'format': ',', 'title': 'CRDT cases'},
    'y': {'name': 'cdc_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'percent', 'format': '.2f', 'title': 'Ratio of CDC to CRDT'},
}
cdc_crdt_title = 'Ratio of CDC to CRDT Cases by State as of %s' % date_display_name

CreateScatterPlotAndMap(
    cdc_crdt_merged_df, cdc_crdt_fields_dict, cdc_crdt_title, total_cases_scale_max, scatter_height, scatter_width, map_width, 'state', 'ratio'
).display()

Texas alone is missing 1.5M cases compared to the total case counts in the CRDT data.

The ratio of CDC to CRDT cases is between 0.03 and 1.64 for all states + D.C.:

  • Average = 0.79
  • Median = 0.97
  • Min = 0.03 (Wyoming)
  • Max = 1.64 (Alaska)
  • Percent between 0.85 and 1.15 = 63% (32 states within +/- 0.15)
  • Percent between 0.50 and 1.50 = 71% (36 states within +/- 0.50)

The 32 states that were within +/-15% of the CRDT data could plausibly be off due to time lags in reporting cases to the CDC vs. reporting them on state public health websites, but there are many outlier states that are too far off from the CRDT case counts to be explained by a time lag:

  • 14 states: < 0.50 ratio of CDC to CRDT cases (Texas, Wyoming, Louisiana, and West Virginia < 0.10)
  • 1 state > 1.50 ratio of CDC to CRDT cases (Alaska)

Counties: CDC vs. NYT

In [237]:
#@title
# CDC vs. NYT county

df = pd.io.gbq.read_gbq(cdc_counties_query, project_id=project_id)
for territory in territories:
  df = df[df.res_state != territory]

project_id = 'msm-secure-data-1b'
df_county_fips_map = pd.io.gbq.read_gbq(f'''
SELECT
*
FROM
  `msm-secure-data-1b.ndunlap_secure.county_fips_mapping`
''', project_id=project_id)

df_county_fips_map.cdc_county = df_county_fips_map.cdc_county.str.lower()
df_county_fips_map['state_county'] = df_county_fips_map.state + '-' + df_county_fips_map.cdc_county
df_county_fips_map['state_county'] = df_county_fips_map.state_county.astype('string').str.strip()
df_county_fips_map.set_index('state_county', inplace=True)
In [238]:
#@title
# Concatenate the state and county names because county names are not unique across states.
df.res_county = df.res_county.str.lower()
df['state_county'] = df.res_state + '-' + df.res_county
df['state_county'] = df.state_county.astype('string').str.strip()
df.set_index('state_county', inplace=True)
df['race_ethnicity_combined'] = df.race_ethnicity_combined.astype('string').str.strip()

race_ethnicity_combined_map = {
    'Asian, Non-Hispanic': 'asian_cases',
    'Black, Non-Hispanic': 'black_cases',
    'White, Non-Hispanic': 'white_cases',
    'American Indian/Alaska Native, Non-Hispanic': 'aian_cases',
    'Hispanic/Latino': 'hispanic_cases',
    'Multiple/Other, Non-Hispanic': 'other_cases',
    'Native Hawaiian/Other Pacific Islander, Non-Hispanic': 'nhpi_cases',
    'Missing': 'unknown_cases',
    'Unknown': 'unknown_cases',
    'NA': 'na_cases',
}
df = df.replace(to_replace={'race_ethnicity_combined': race_ethnicity_combined_map})
In [239]:
#@title
mismatches_df = df.join(df_county_fips_map, on="state_county", how='outer', lsuffix='_left', rsuffix='_right')
mismatches_df = mismatches_df[mismatches_df.county_fips.isna()]
mismatches_df = mismatches_df[mismatches_df.res_state != 'NA']
mismatches_df = mismatches_df[mismatches_df.res_state != 'Unknown']
mismatches_df = mismatches_df[mismatches_df.res_county != 'na']
mismatches_df = mismatches_df[mismatches_df.res_county != 'unknown']
#print(mismatches_df.cases.sum())
#print('vs. 60363 with NULL county_fips_code')
# SELECT 
#count(*) as total_cases,
#FROM `msm-secure-data-1b.ndunlap_secure.cdc_restricted_access_20201231`
#WHERE county_fips_code IS NULL
In [240]:
#@title
merged_df = df.join(df_county_fips_map, on="state_county", how='inner', lsuffix='_left', rsuffix='_right')

# Create a crosstab table with rows = counties, columns = race_ethnicity_combined.
crosstab_df = pd.crosstab(merged_df['county_fips'], merged_df.race_ethnicity_combined, values=merged_df.cases, aggfunc=sum,
                          margins=True,
                          margins_name='total_cases'
)
# Have to reset_index() to go from pandas multi-index to single index.
crosstab_df = crosstab_df.reset_index()
crosstab_df.drop(axis=0, index=len(crosstab_df) - 1, inplace=True)
crosstab_df['county_fips'] = crosstab_df.county_fips.astype(int)
crosstab_df['total_known_cases'] = crosstab_df['total_cases'] - crosstab_df.unknown_cases.fillna(0)
crosstab_df['total_known_cases'] = crosstab_df['total_cases'] - crosstab_df.na_cases.fillna(0) - crosstab_df.unknown_cases.fillna(0)
In [241]:
#@title
# Get the display names for each county.
# Use ACS data that only has one FIPS code per county unlike the fips_county_map.
df_acs_name_lookup = pd.io.gbq.read_gbq(f'''
SELECT
  *
FROM
  `msm-internal-data.ipums_acs.acs_2019_5year_county`
''', project_id=project_id)

df_acs_name_lookup['state_county'] = df_acs_name_lookup.county.astype('string').str.strip() + ', ' + df_acs_name_lookup.state.astype('string').str.strip()
df_acs_name_lookup.drop(columns=['state', 'county'], inplace=True)
df_acs_name_lookup.set_index('county_fips', inplace=True)

county_chart_df = crosstab_df.join(df_acs_name_lookup, on="county_fips", how='inner', lsuffix='_left', rsuffix='_right')
county_chart_df.county_fips = county_chart_df.county_fips.astype(int)

#print(len(county_chart_df))
#print(county_chart_df.total_pop.sum())
#print(county_chart_df.total_pop.sum() / 324697795)  # Population covered in these counties
#print(0.55 * 324697795) # NYT population
In [242]:
#@title

nyt_counties_df = pd.io.gbq.read_gbq(nyt_counties_query, project_id=project_id)
nyt_counties_df.rename(columns={'county_fips_code': 'county_fips'}, inplace=True)
nyt_counties_df.county_fips.unique()
nyt_counties_df['county_fips'] = nyt_counties_df.county_fips.astype(int)
nyt_counties_df.set_index('county_fips', inplace=True)

county_chart_df.set_index('county_fips', inplace=True)
nyt_merged_df = county_chart_df.join(nyt_counties_df, on="county_fips", how='inner', lsuffix='_left', rsuffix='_right')
nyt_merged_df = nyt_merged_df.reset_index()
nyt_merged_df['percent'] = round(nyt_merged_df.total_cases / nyt_merged_df.nyt_cases, 2)

#PrintSummaryStats(nyt_merged_df)

We can do the same analysis at the county level using the CDC vs. NYT data.

Each dot is a county (hover to see details). We show all 3,046 available counties on the left and zoom in on the smaller counties on the right.

In [290]:
#@title
cdc_nyt_fields_dict = {
    'x': {'name': 'nyt_cases', 'format': ',', 'title': 'NYT cases'},
    'y': {'name': 'total_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'percent', 'format': '.2f', 'title': 'Ratio of CDC to NYT'},
}
cdc_nyt_title = 'Ratio of CDC Cases to NYT Cases by County as of Dec 16'
zoom_cdc_nyt_title = 'Zoom in on counties with up to 100,000 Cases'

scatter = CreateScatterPlot(
    nyt_merged_df, cdc_nyt_fields_dict, cdc_nyt_title, 700000, scatter_height, scatter_width, 'county', 'ratio'
)
zoom_scatter = CreateScatterPlot(
    nyt_merged_df, cdc_nyt_fields_dict, zoom_cdc_nyt_title, 100000, scatter_height, scatter_width, 'county', 'ratio'
)

(scatter | zoom_scatter).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).configure_mark(
    stroke='grey'
).display()

Harris County, Texas is missing 211K cases compared to the total case counts in the NYT data.

The ratio of CDC to NYT cases is between 0.00 and 9.80 for the 3,046 counties in the CDC data:

  • Average = 0.73
  • Median = 0.87
  • Min = 0.00
  • Max = 9.80 (Lake and Peninsula Borough, Alaska)
  • Percent between 0.85 and 1.15 = 48% (1,463 counties within +/- 0.15)
  • Percent between 0.50 and 1.50 = 70% (2,139 counties within +/- 0.50)

We can also view these ratios on the map on the right and compare them to the state-level totals map from the previous section on the left.

In [291]:
#@title
cdc_nyt_fields_dict = {
    'x': {'name': 'nyt_cases', 'format': ',', 'title': 'NYT cases'},
    'y': {'name': 'total_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'percent', 'format': '.2f', 'title': 'Ratio of CDC to NYT'},
}
cdc_nyt_title = 'Ratio of CDC Cases to NYT Cases by County as of Dec 16'

cdc_nyt_map = CreateMap(
    nyt_merged_df, cdc_nyt_fields_dict, cdc_nyt_title, total_cases_scale_max, map_height, map_width, 'county', 'ratio'
)
cdc_crdt_map = CreateMap(
    cdc_crdt_merged_df, cdc_crdt_fields_dict, cdc_crdt_title, total_cases_scale_max, map_height, map_width, 'state', 'ratio'
)

(cdc_crdt_map | cdc_nyt_map).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).display()

Notes:

  • The legend only goes to 2.0, and all counties with a larger ratio are shown in the same dark blue color.
  • A larger version of the county map for hovering over smaller counties is available in the Appendix.

We can see that the ratio of the CDC case data to CRDT/NYT aggregate data is highly variable across the U.S., but there is less variability across the counties within each state. This pattern indicates that the data completeness issues may be due to state-level policies or data collection processes rather than at the county level. We can also see that some counties are missing entirely from the data; e.g., in Texas, Wyoming, West Virginia, and Nebraska. It's possible some of these counties have cases in the data but the county name was suppressed for privacy reasons due to small population sizes. Even so, those cases would still have a state name, so they would be captured in the map on the left above.

Cases with Race/Ethnicity

States and Counties: CDC

In [245]:
#@title
states_df = pd.io.gbq.read_gbq(compare_cases_unknowns_query, project_id=project_id)
for state in ('Unknown', 'NA', 'OCONUS'):
  states_df = states_df[states_df.res_state != state]

states_df['race_ethnicity_combined'] = states_df.race_ethnicity_combined.astype('string').str.strip()
states_df = states_df.replace(to_replace={'race_ethnicity_combined': {
    'Asian, Non-Hispanic': 'cdc_known_cases',
    'Black, Non-Hispanic': 'cdc_known_cases',
    'White, Non-Hispanic': 'cdc_known_cases',
    'American Indian/Alaska Native, Non-Hispanic': 'cdc_known_cases',
    'Hispanic/Latino': 'cdc_known_cases',
    'Multiple/Other, Non-Hispanic': 'cdc_known_cases',
    'Native Hawaiian/Other Pacific Islander, Non-Hispanic': 'cdc_known_cases',
    'Missing': 'cdc_unknown_cases',
    'Unknown': 'cdc_unknown_cases',
    'NA': 'cdc_na_cases',
    }})
states_df.rename(columns={'res_state': 'state'}, inplace=True)
In [246]:
#@title
crosstab_df = pd.crosstab(states_df['state'], states_df.race_ethnicity_combined, values=states_df.cdc_cases, aggfunc=sum,
                          margins=True,
                          margins_name='cdc_cases'
)
# Have to reset_index() to go from pandas multi-index to single index.
crosstab_df = crosstab_df.reset_index()
crosstab_df.drop(axis=0, index=len(crosstab_df) - 1, inplace=True)
crosstab_df['cdc_known_or_na_cases'] = crosstab_df['cdc_cases'] - crosstab_df.cdc_unknown_cases.fillna(0)
crosstab_df['cdc_known_cases'] = crosstab_df['cdc_cases'] - crosstab_df.cdc_na_cases.fillna(0) - crosstab_df.cdc_unknown_cases.fillna(0)
crosstab_df

crdt_merged_df = crosstab_df.join(crdt_df, on="state", how='inner', lsuffix='_left', rsuffix='_right')
crdt_merged_df.reset_index(inplace=True)
crdt_merged_df['state_fips_code'] = crdt_merged_df.state
crdt_merged_df = crdt_merged_df.replace(to_replace={'state_fips_code': states_to_fips})
crdt_merged_df['cdc_known_cases_percent'] = round(crdt_merged_df.cdc_known_cases / crdt_merged_df.cdc_cases, 4)
crdt_merged_df['cdc_known_or_na_cases_percent'] = round(crdt_merged_df.cdc_known_or_na_cases / crdt_merged_df.cdc_cases, 4)
crdt_merged_df['percent'] = round(crdt_merged_df.cdc_cases / crdt_merged_df.crdt_cases, 4)
crdt_merged_df['percent_known_cases'] = round(crdt_merged_df.cdc_known_cases / crdt_merged_df.crdt_known_race_cases, 4)

crdt_merged_df_no_ny = crdt_merged_df[crdt_merged_df.state != 'NY']
#PrintSummaryStats(crdt_merged_df_no_ny)

When evaluating the percent of cases that report on race/ethnicity in the CDC data, we also need to consider the 2% of overall cases with race/ethnicity that were suppressed due to privacy reasons. We should give states and counties credit for reporting race/ethnicity data for those cases even if we aren't able to use it due to privacy suppression. Below, the maps on the top left shows the percent of cases with known race/ethnicity and the map on the top right shows the percent of cases with known or suppressed race/ethnicity. The maps on the bottom show the same information at the county level.

In [300]:
#@title

chart_df = county_chart_df.copy(deep=True)
chart_df.reset_index(inplace=True)
chart_df.county_fips = chart_df.county_fips.astype(int)
chart_df['percent_known_cases'] = round(chart_df.total_known_cases / chart_df.total_cases, 2)
chart_df['total_known_or_na_cases'] = chart_df.total_known_cases.fillna(0) + chart_df.na_cases.fillna(0)
chart_df['percent_known_or_na_cases'] = round(chart_df.total_known_or_na_cases / chart_df.total_cases, 2)
In [301]:
#@title
cdc_known_state_fields_dict = {
    'x': {'name': 'cdc_known_cases', 'format': ',', 'title': 'Known race/ethnicity cases'},
    'y': {'name': 'cdc_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'cdc_known_cases_percent', 'format': '.0%', 'title': 'Percent known cases'},
}

cdc_known_state_title = 'CDC Cases with Known Race/Ethnicity as of %s' % date_display_name
cdc_known_state_map = CreateMap(
    crdt_merged_df, cdc_known_state_fields_dict, cdc_known_state_title, total_cases_scale_max, map_height, map_width, 'state', 'percent'
)

cdc_known_or_na_state_fields_dict = {
    'x': {'name': 'cdc_known_or_na_cases', 'format': ',', 'title': 'Known or suppressed race/ethnicity cases'},
    'y': {'name': 'cdc_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'cdc_known_or_na_cases_percent', 'format': '.0%', 'title': 'Percent known or suppressed cases'},
}
cdc_known_or_na_state_title = 'CDC Cases with Known+Suppressed Race/Ethnicity as of %s' % date_display_name
cdc_known_or_na_state_map = CreateMap(
    crdt_merged_df, cdc_known_or_na_state_fields_dict, cdc_known_or_na_state_title, total_cases_scale_max, map_height, map_width, 'state', 'percent'
)

(cdc_known_state_map | cdc_known_or_na_state_map).configure(
    padding={"left": 0, "top": 5, "right": 0, "bottom": 5}
).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).display()
In [302]:
#@title
cdc_known_county_fields_dict = {
    'x': {'name': 'total_known_cases', 'format': ',', 'title': 'Known race/ethnicity cases'},
    'y': {'name': 'total_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'percent_known_cases', 'format': '.0%', 'title': 'Percent known cases'},
}
cdc_known_county_title = 'CDC Cases with Known Race/Ethnicity as of %s' % date_display_name
cdc_known_county_map = CreateMap(
    chart_df, cdc_known_county_fields_dict, cdc_known_county_title, total_cases_scale_max, map_height, map_width, 'county', 'percent'
)

cdc_known_or_na_county_fields_dict = {
    'x': {'name': 'total_known_or_na_cases', 'format': ',', 'title': 'Known or suppressed race/ethnicity cases'},
    'y': {'name': 'total_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'percent_known_or_na_cases', 'format': '.0%', 'title': 'Percent known or suppressed cases'},
}
cdc_known_or_na_county_title = 'CDC Cases with Known+Suppressed Race/Ethnicity as of %s' % date_display_name
cdc_known_or_na_county_map = CreateMap(
    chart_df, cdc_known_or_na_county_fields_dict, cdc_known_or_na_county_title, total_cases_scale_max, map_height, map_width, 'county', 'percent'
)

(cdc_known_county_map | cdc_known_or_na_county_map).configure(
    padding={"left": 0, "top": 5, "right": 0, "bottom": 5}
).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).display()
In [250]:
#@title
#PrintSummaryStats(crdt_merged_df, field='cdc_known_cases_percent')
#PrintSummaryStats(crdt_merged_df, field='cdc_known_or_na_cases_percent')
#tuple(crdt_merged_df[crdt_merged_df.cdc_known_or_na_cases_percent <= .5].state)

Note: A larger version of the county maps for hovering over smaller counties is available in the Appendix.

We can see that the maps on the right are bluer than those on the left, which means that states and counties are doing a better job at reporting race/ethnicity when we consider the data that was suppressed for privacy reasons. This effect is most pronounced in states like Wyoming, Texas, and Louisiana, which have many counties with small populations or small population subgroups.

We can see the increase in the percent of cases with known race/ethnicity --> known or suppressed suppressed across all states:

  • Average = 59% --> 65%
  • Median = 65% --> 69%
  • Min = 3% --> 13%
  • Max = 88% --> 96%
  • Percent above 85% = 4% --> 12% (2 --> 6 states)
  • Percent above 50% = 75% --> 78% (38 --> 40 states)

However, even if you include the cases with suppressed race/ethnicity, California alone is still missing race/ethnicity data for 1.5M cases. Los Angeles County alone is responsible for over half a million - 508K - of the cases in California missing race/ethnicity data.

States: CDC vs. CRDT

How does the CDC data compare to the CRDT data, which is the most up-to-date aggregate data we have for race/ethnicity at the state level?

In the comparison below, we look at the number of people in the CRDT with known race within each state. If a state uses a combined race/ethnicity field, then it's a straightforward comparison to the CDC's combined race/ethnicity field. If a state uses separate fields for race/ethnicity, then we still use the number of people with known race within each state because all of the race categories will also contain Hispanic/Latino people. We could potentially be undercounting the number of people with known race/ethnicity in the CRDT is if there are people who have unknown race but known ethnicity. If we adjusted the numbers in those cases, it would make the CDC numbers look even worse in comparison to the CRDT than they already do, so we can think of the comparison below as a best-case scenario for the CDC data.

In [293]:
#@title
fields_dict = {
    'x': {'name': 'crdt_known_race_cases', 'format': ',', 'title': 'CRDT known race/ethnicity cases'},
    'y': {'name': 'cdc_known_cases', 'format': ',', 'title': 'CDC known race/ethnicity cases'},
    'percent': {'name': 'percent_known_cases', 'format': '.2f', 'title': 'Ratio of CDC to CRDT'},
}
title = 'Ratio of CDC to CRDT Cases with Known Race/Ethnicity as of %s' % date_display_name

CreateScatterPlotAndMap(
    crdt_merged_df, fields_dict, title, 1200000, scatter_height, scatter_width, map_width - 5, 'state', 'ratio'
).display()

Notes:

  • New York is marked as gray in the map because it has 0 cases with known race/ethnicity in CRDT vs. 345K cases in the CDC data.

The ratio of CDC to CRDT cases with known race/ethnicity is between 0.01 and 1.18 for all states excluding New York:

  • Average = 0.63
  • Median = 0.77
  • Min = 0.01 (North Dakota, Louisiana, Wyoming)
  • Max = 1.18 (Massachusetts)
  • Percent between 0.85 and 1.15 = 34% (17 states within +/- 0.15)
  • Percent between 0.50 and 1.50 = 64% (32 states within +/- 0.50)

CRDT has race/ethnicity data for 800K people more people in California and 533K more people in Florida than the CDC data has.

Only 4 states (Massachusetts, Minnesota, Utah, Washington) had more cases with known race/ethnicity in the CDC data than in the CRDT data, whereas 23 states had more total cases in the CDC data than in the CRDT data in the section above.

We can also look at the CRDT's percentage of cases with known race/ethnicity on its own, similar to charts for the CDC data in the previous section. Overall, 66% of the cases in the CRDT data have known race/ethnicity compared to 55% in the CDC data (57% with suppressed data).

In [286]:
#@title
crdt_known_state_fields_dict = {
    'x': {'name': 'crdt_known_race_cases', 'format': ',', 'title': 'Known race/ethnicity cases'},
    'y': {'name': 'crdt_cases', 'format': ',', 'title': 'CDC cases'},
    'percent': {'name': 'crdt_known_race_cases_percent', 'format': '.0%', 'title': 'Percent known cases'},
}

crdt_known_state_title = 'CRDT Cases with Known Race/Ethnicity as of %s' % date_display_name
crdt_known_map = CreateMap(
    cdc_crdt_merged_df, crdt_known_state_fields_dict, crdt_known_state_title, total_cases_scale_max, map_height, map_width, 'state', 'percent'
)

(crdt_known_map | cdc_known_state_map).configure(
    padding={"left": 0, "top": 5, "right": 0, "bottom": 5}
).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).display()
In [253]:
#@title
#print(crdt_merged_df.crdt_known_race_cases.sum() / crdt_merged_df.crdt_cases.sum())
#PrintSummaryStats(cdc_crdt_merged_df, field='crdt_known_race_cases_percent')

The percent of CRDT cases with known race/ethnicity is between 0% and 99% for all states:

  • Average = 73%
  • Median = 76%
  • Min = 0% (New York)
  • Max = 99% (District of Columbia)
  • Percent above 85% = 24% (12 states)
  • Percent above 50% = 96% (49 states)

CRDT is a more complete source for race/ethnicity data at the state level than the CDC data in terms of both the counts of cases with race/ethnicity data and the percentage of cases with race/ethnicity data. The only exceptions to this are New York, which has no cases with race/ethnicity, and possibly Massachusetts, which has 1.18 times as many cases with race/ethnicity than the CRDT.

How to Improve State and County Data

How can states and counties improve their data completeness for race/ethnicity data, especially when compared to the more reliable and up-to-date aggregate data that comes from public health websites, as collected by the CRDT and NYT?

There are two ways in which states can improve the data they send to the CDC:

  1. Increase the total cases reported to get closer to the aggregate data.
  2. Increase the percentage of cases reported with known race/ethnicity to get closer to 100%.

In the Total Case Counts section above, we identified the states and counties with the biggest discrepancies relative to aggregate data. In the Cases with Race/Ethnicity section, we looked at the percentage of cases within each state and county that have race/ethnicity data.

The charts below show those two components together; the scatterplots show (1) the discrepancy vs. CRDT/NYT total case counts on the y-axis, and (2) the percentage of cases with known or suppressed race/ethnicity on the x-axis. The colors of the dots and on the map show the product of those two numbers, which is the percentage of CRDT/NYT total case counts accounted for in the CDC data with race/ethnicity. This is a composite measure of the percentage of total cases are included in the CDC data with known or suppressed race/ethnicity.

The scatterplots below can help us diagnose the issues in each state or county:

  • Bottom left quadrant: Low percentage of cases reported, low reporting of race/ethnicity.
  • Top left quadrant: Mid-to-high percentage of cases reported, low reporting of race/ethnicity.
  • Bottom right quadrant: Low percentage of cases reported, mid-to-high reporting of race/ethnicity.
  • Top right quadrant: Mid-to-high percentage of cases reported, mid-to-high reporting of race/ethnicity.
In [303]:
#@title
nyt_cdc_known_merged_df = chart_df.join(nyt_counties_df, on="county_fips", how='inner', lsuffix='_left', rsuffix='_right')
nyt_cdc_known_merged_df.reset_index(inplace=True)
nyt_cdc_known_merged_df['percent'] = round(nyt_cdc_known_merged_df.total_cases / nyt_cdc_known_merged_df.nyt_cases, 2)
In [304]:
#@title
crdt_merged_df['percent_max_100'] = crdt_merged_df.percent.clip(upper=1)
crdt_merged_df['percent_reccs'] = crdt_merged_df.percent_max_100 * crdt_merged_df.cdc_known_or_na_cases_percent
state_reccs_fields_dict = {
    'y': {'name': 'percent_max_100', 'format': '.0%', 'title': 'CDC percent of CRDT total cases'},
    'x': {'name': 'cdc_known_or_na_cases_percent', 'format': '.0%', 'title': 'CDC percent with known or suppressed race/ethnicity'},
    'percent': {'name': 'percent_reccs', 'format': '.0%', 'title': 'Product: CDC percent of CRDT total with race/ethnicity'},
}
state_reccs_title = 'State Completeness: Total Cases x Race/Ethnicity'

scatter = CreateScatterPlotAndMap(
    crdt_merged_df, state_reccs_fields_dict, state_reccs_title, 1, scatter_height, scatter_width, map_width, 'state', 'percent'
)
scatter.configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).configure_mark(
    stroke='grey'
).display()
In [305]:
#@title
nyt_cdc_known_merged_df['percent_max_100'] = nyt_cdc_known_merged_df.percent.clip(upper=1)
nyt_cdc_known_merged_df['percent_reccs'] = nyt_cdc_known_merged_df.percent_max_100 * nyt_cdc_known_merged_df.percent_known_or_na_cases
county_reccs_fields_dict = {
    'y': {'name': 'percent_max_100', 'format': '.0%', 'title': 'CDC percent of NYT total cases'},
    'x': {'name': 'percent_known_or_na_cases', 'format': '.0%', 'title': 'CDC percent with known or suppressed race/ethnicity'},
    'percent': {'name': 'percent_reccs', 'format': '.0%', 'title': 'Product: CDC percent of NYT total with race/ethnicity'},
}
county_reccs_title = state_reccs_title = 'County Completeness: Total Cases x Race/Ethnicity'


scatter = CreateScatterPlotAndMap(
    nyt_cdc_known_merged_df, county_reccs_fields_dict, county_reccs_title, 1, scatter_height, scatter_width, map_width, 'county', 'percent'
)
scatter.configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).configure_mark(
    stroke='grey'
).display()

Notes:

  • All states or counties with > 100% of the total cases in the CRDT or NYT data were capped at 100%.
  • A larger version of the county map for hovering over smaller counties is available in the Appendix.

For states and counties to improve the percentage of cases with known race/ethnicity, most need to reduce their number of Unknowns or missing data. A few states, however, seem to have errors where some race/ethnicity groups are almost entirely missing from the data:

  • California (22% known or suppressed): 0% of the cases are Hispanic/Latino (96 cases out of 1.9M).
  • North Dakota (13% known or suppressed): The only races reported are Asian and American Indian / Alaska Native.
  • Delaware (13% known or suppressed): White and Black cases are both 0% of total (10 cases each out of 46K).

Another way to evaluate states with issues is to see if they are improving over time with the percentage of cases with known race/ethnicity. We can look at the 11 states in the left side of the states scatterplot above that have < 50% of cases with known or suppressed race/ethnicity.

In [257]:
#@title
compare_cases_unknowns_query = ('''
SELECT
  res_state,
  race_ethnicity_combined,
  COUNT(*) as cdc_cases
FROM
  %s
WHERE res_state = 'RI'
GROUP BY
   res_state,
   race_ethnicity_combined
''' % cdc_table)
state_df = pd.io.gbq.read_gbq(compare_cases_unknowns_query, project_id=project_id)
state_df['percent'] = round(state_df.cdc_cases / state_df.cdc_cases.sum(), 2)
In [258]:
#@title
cdc_states_by_month_query = ('''
SELECT
  res_state,
  CONCAT(EXTRACT(YEAR from cdc_case_earliest_dt), '-Q', EXTRACT(QUARTER from cdc_case_earliest_dt)) as date,
  COUNT(*) as total_cases,
FROM
  %s
WHERE
  cdc_case_earliest_dt >= DATE(2020, 1, 1) AND
  cdc_case_earliest_dt < DATE(2020, 12, 1) AND
  res_state in ('AK', 'CA', 'CT', 'DE', 'GA', 'LA', 'MD', 'ND', 'NY', 'PA', 'RI')
GROUP BY
   1, 2
ORDER BY
   1, 2
''' % cdc_table)

cdc_states_by_month_known_or_na_query = ('''
SELECT
  res_state,
  CONCAT(EXTRACT(YEAR from cdc_case_earliest_dt), '-Q', EXTRACT(QUARTER from cdc_case_earliest_dt)) as date,
  COUNT(*) as known_or_na_cases,
FROM
  %s
WHERE
  cdc_case_earliest_dt >= DATE(2020, 1, 1) AND
  cdc_case_earliest_dt < DATE(2020, 12, 1) AND
  race_ethnicity_combined != 'Unknown' AND
  race_ethnicity_combined != 'Missing'
GROUP BY
   1, 2
ORDER BY
   1, 2
''' % cdc_table)

cdc_states_by_month_df = pd.io.gbq.read_gbq(cdc_states_by_month_query, project_id=project_id)
cdc_states_by_month_df.set_index(keys=['res_state', 'date'], inplace=True)

cdc_states_by_month_known_or_na_df = pd.io.gbq.read_gbq(cdc_states_by_month_known_or_na_query, project_id=project_id)
cdc_states_by_month_known_or_na_df.set_index(keys=['res_state', 'date'], inplace=True)

cdc_known_over_time = cdc_states_by_month_df.join(cdc_states_by_month_known_or_na_df, how='left')
cdc_known_over_time['percent_known_or_na'] = round(cdc_known_over_time.known_or_na_cases / cdc_known_over_time.total_cases, 2)
cdc_known_over_time.reset_index(inplace=True)
In [259]:
#@title
base = alt.Chart(cdc_known_over_time).mark_line(point=True).encode(
    x=alt.X('date', title='CDC earliest report date', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('percent_known_or_na', title='Percent unknown or suppressed race/ethnicity', axis=alt.Axis(format='%')),
    color=alt.Color('res_state', scale=alt.Scale(scheme='category20'), title='State')
).properties(
    title='States with less than 50% of Cumulative Cases with Known or Suppressed Race/Ethnicity',
    height=map_height,
    width=map_width
).display()

We can see that a few of these states have improved over time and now have more than 50% of cases with known or suppressed race/ethnicity: Louisiana, New York, and Georgia. A few states started off the year with greater than 50%: Alaska, Connecticut, and Maryland. But none of these states have fully fixed their issues with the possible exception of Louisiana if it continues on its current trajectory.

Appendix

Additional CDC Data Fields

The additional fields in the data, including whether the person died or was hospitalized, are all known for fewer than 50% of cases.

In [260]:
#@title
field_list = ['death_yn', 'hosp_yn', 'icu_yn', 'onset_dt', 'pos_spec_dt', 'hc_work_yn',
              'pna_yn', 'abxchest_yn', 'acuterespdistress_yn', 'mechvent_yn', 'fever_yn', 'sfever_yn', 'chills_yn', 'myalgia_yn', 'runnose_yn',
              'sthroat_yn', 'cough_yn', 'sob_yn', 'nauseavomit_yn', 'headache_yn', 'abdom_yn', 'diarrhea_yn', 'medcond_yn']
project_id = 'msm-secure-data-1b'
table = '`msm-secure-data-1b.ndunlap_secure.cdc_restricted_access_20201231`'
FieldAnalysis(project_id, table, field_list).display()

The CDC also commented on these fields in their case data FAQs:

Because of the volume of cases, most health departments are unable to conduct investigations of every case to obtain additional information. Because of this, most case reports are missing data on patient demographics, symptoms, underlying health conditions, characteristics of hospitalizations such as ventilator use, and other factors such as recent travel history.

The case report form contains many more fields, but unfortunately, the data gets less complete as you go down the form. Citizens for Responsibility and Ethics in Washington (CREW) obtained a version of this data via FOIA that contained 101 fields with data up to Aug 25, 2020 and shared it with MSM/SHLI. Several of the additional fields from that dataset are shown below; the field with the most known data is whether the case was associated with an outbreak, but even that is only known for 30% of cases.

In [261]:
#@title
field_list = ['death_week', 'icu_length', 'hosp_length', 'translator_yn', 'housing', 'exp_work_critical', 'outbreak_associated',
              'rigors_yn', 'taste_yn', 'fatigue_yn', 'wheezing_yn', 'diffbreathing_yn', 'chestpain_yn', 'test_pcr', 'test_serologic',
              'exp_adultfacility', 'exp_airport', 'exp_animal', 'exp_community', 'exp_gathering', 'exp_contact', 'exp_correctional',
              'exp_ship', 'exp_house', 'exp_other', 'exp_school', 'exp_othcountry', 'exp_unk', 'exp_work']
project_id = 'msm-internal-data'
table = '`msm-internal-data.crew.covid_case_surveillance`'
FieldAnalysis(project_id, table, field_list).display()

Large county maps

To make it easier to hover over small counties, here are larger versions of the county maps that appeared in this report.

In [295]:
#@title
cdc_nyt_map = CreateMap(
    nyt_merged_df, cdc_nyt_fields_dict, cdc_nyt_title, total_cases_scale_max, map_height * 2, map_width * 2, 'county', 'ratio'
).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
)
cdc_nyt_map.display()
In [296]:
#@title
cdc_known_county_map = CreateMap(
    chart_df, cdc_known_county_fields_dict, cdc_known_county_title, total_cases_scale_max, map_height * 2, map_width * 2, 'county', 'percent'
).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
)

cdc_known_county_map.display()
In [306]:
#@title
cdc_known_or_na_county_map = CreateMap(
    chart_df, cdc_known_or_na_county_fields_dict, cdc_known_or_na_county_title, total_cases_scale_max, map_height * 2, map_width * 2, 'county', 'percent'
).configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
)
cdc_known_or_na_county_map.display()
In [307]:
#@title
county_completeness = CreateMap(
    nyt_cdc_known_merged_df, county_reccs_fields_dict, county_reccs_title, 1, map_height * 2, map_width * 2, 'county', 'percent'
)
county_completeness.configure_view(
    strokeWidth=0,
).configure_legend(
    gradientLength=map_height - 50
).configure_mark(
    stroke='grey'
).display()
In [265]:
 

Geographic notes

The CDC data includes a county_fips_code field with a unique identifier for each county. However, we ended up using a different lookup for state/county to county FIPS codes due to some data quality issues. When we used the county_fips_code field provided in the CDC data, 60K records had no county_fips_code, including all cases in D.C., Long Beach, CA, Pasadena, CA, and Alameda, CA as well as several other locations. We created a lookup using American Community Survey (ACS) data and then modified the lookup to handle cases of misspellings and other issues in the CDC data. We documented the changes to the ACS mapping and included the new mapping in this spreadsheet.

With the new mapping, we now match all but 12K cases to county FIPS codes, and many of those have NA or Unknown for their county or state. We also identified 63 non-existent state-county combinations listed here that the CDC file was matching to county_fips_codes for ~700 cases. We no longer match them to any county_fips_codes, but we do report them in the state-level data matching that state.

One geographical exception worth noting is that two areas in Alaska reported their numbers in a combined way:

  • Hoonah-angoon And Yakutat Combined: 43 cases
  • Bristol Bay And Lake And Peninsula Combined: 95 cases

We excluded these cases from the county-level analysis above.

In [ ]:
%%shell
jupyter nbconvert --to html 'cdc_case_data.ipynb'